Biological data processing

ABSTRACT

A multi-database query system which queries a plurality of databases and servers, including an input which receives queries in a structured form and a translation server which translates at least a part of a received query into commands recognized by a data manipulation server.

This application claims the benefit of Provisional Application No.60/141,424, filed Jun. 29, 1999.

FIELD OF THE INVENTION

The present invention relates to automated database searching and inparticular to automated access to biological databases.

BACKGROUND OF THE INVENTION

One of the tasks performed in biological research is comparison of newlydiscovered biological data with data stored in databases. Over twohundred public biological databases are available around the world, manyon the Internet. In general, databases include a plurality of recordswhich have the form of an object class. The object class is formed of aplurality of fields, often in a hierarchy in which an object classincludes one or more sub-object classes which in turn may includesub—sub object classes. The records may represent, for example, genesequences and may have fields which include various data about thesequences, such as their length, origin and a view of the sequence.Information is extracted from databases by querying a management systemassociated with the database. A simple query includes a request todisplay one or more fields of records which fulfill a certain criteria.

The existing databases have different organization methodologies, e.g.,different fields in each record and different query schemes. In order toaccess these databases with ease, an Object Protocol Model (OPM) suiteof tools was developed. An OPM processor mediates between a user anddatabases associated with the OPM suite. A common organizationmethodology is used to represent the data in all the databases accessedvia the OPM processor. Queries addressed to databases via the OPMprocessor are provided, by a user to the OPM processor, in a structuredform expressed in accordance with the common organization methodology.The OPM processor translates the queries from the structured OPM form toquery forms compatible with the management systems of the specificdatabases to which the queries are addressed. The results from thespecific databases are returned to the OPM processor which translatesthe results back to the organization methodology of the OPM suite. Notonly does the OPM suite allow a user to access a plurality of differentdatabases in different forms, it also allows the user to access aplurality of databases using a single query. For example, a complexquery may request to display the records from a first database whichhave a gene length greater than of corresponding records of a seconddatabase which represent the same organism.

The use of a common organization methodology across databases allowsusing special tools for more easily generating queries and/or performingmore complex queries. For example, a graphic user interface (GUI) of theOPM suite allows the user to prepare a query in a structured manner.

Some of the forms of biological data are complex data structures, suchas gene sequences, which require special procedures for manipulation,for example, for performing comparisons. Homology search engines, suchas BLAST, are used to compare gene sequences. When a user wants tocompare, for example, all the gene sequences classified in a certainmonth to one or more groups of gene sequences, the user retrieves allthe desired classified gene sequences using OPM. Then, the user passesthe retrieved data to a homology sequence server which performs thesequence comparison.

SUMMARY OF THE INVENTION

One aspect of some embodiments of the invention provides a method foraccessing data manipulation servers using a structured query format usedto query databases. Optionally, the accessing of manipulation servers isintegrated with the accessing of database information, for example bymanipulating the results of the data access and/or by using the resultsof the data manipulation as data to be accessed or for restrictingqueries.

One aspect of some embodiments of the present invention relates to amulti-database query system which receives queries which relate to bothdatabase and data manipulation servers, such as homology search engines.The queries relate to the data manipulation servers as if they aredatabase servers, allowing use of any tool of the multi-database querysystem developed for database queries, on queries which access datamanipulation servers. Such tools include, for example, database linkingtools, graphic query preparation tools and query optimization tools. Byrelating to databases and data manipulation servers from a single query,the data manipulation server may process results from the database asthey are provided before the database runs through all its records.Alternatively or additionally, the results of a data manipulation stepmay be further queried. Thus, the response time required for a complexquery may be substantially reduced. Alternatively or additionally, theamount of traffic on a network may be reduced and/or better spread outin time. Also, complex operations may require less of a userintervention.

In some embodiments of the present invention, the input to and/or outputfrom of the data manipulation servers are modeled by structured objects.The modeled input objects may result from processing other sections ofthe query. The modeled output objects may be further processed by othersections of the query or even further manipulated by other (or the same)manipulation servers.

In some embodiments of the invention, each data manipulation serverassociated with the query system has a translation server which mediatesbetween the data manipulation server and the query system. Thetranslation server receives commands from the server in a structuredquery form used by the query system and translates the commands to aform in which the data manipulation server receives commands. Thetranslation server optionally also receives results from the datamanipulation server and presents the results to the query system inobjects organized according to structured object classes used by thequery system.

There is thus provided in accordance with an embodiment of theinvention, a multi-database query system which queries a plurality ofdatabases and servers, including an input which receives queries in astructured form, and a translation server which translates at least apart of a received query into commands recognized by a data manipulationserver.

Optionally, the system comprises a processor which parses the receivedquery into parts according to the databases and servers to which theyrelate. Alternatively or additionally, the structured form comprises aform used to query databases. Alternatively or additionally, the inputreceives a query which relates to at least one database and at least onedata manipulation server. Alternatively or additionally, the translationserver models results from the data manipulation server into databaseobjects. Alternatively or additionally, the data manipulation servercomprises a server which receives input from a least two differentsources. Optionally, the data manipulation server comprises a homologycomparison engine.

There is also provided in accordance with an embodiment of theinvention, a method of accessing a data manipulation server from amulti-database query system, including providing the query system with aquery which includes a first directive assigning a value to at least onefield of an input object associated with the data manipulation serverand a second directive which determines a value of at least one field ofan output object associated with the data manipulation server, andinvoking the data manipulation server responsive to the seconddirective. Optionally, providing the query comprises preparing the queryusing a graphical interface designed for querying structured databases.Alternatively or additionally, the data manipulation server comprises ahomology engine.

There is also provided in accordance with an embodiment of theinvention, a method of performing a database search using amulti-database query system, including providing the query system with aquery which includes at least one directive related to a database and atleast one directive related to a data manipulation server, wherein thedirectives are stated in an identical structural format, translating thedirectives into commands recognized by the database and the datamanipulation server, and submitting the commands respectively to thedata manipulation server and to the database.

Optionally, the data manipulation server comprises a homology comparisonengine. Alternatively or additionally, translating the directivescomprises identifying, by a query processor, the directives directed tothe database and the directives directed to the data manipulationserver. Optionally, translating the directives comprises passing thedirectives to translation servers associated with the database or datamanipulation server to which the directives are directed. Alternativelyor additionally, the method comprises determining an order for thedirectives to be processed in and submitting the translated directivesto the data manipulation server and to the database according to thedetermined order.

In some embodiments, the method comprises receiving results from saidsubmission and translating the results into structured objects.Optionally, translating the results into structured objects comprisestranslating the results to structured objects related to the directives.

Alternatively or additionally, providing a query comprises providing aquery in an Object Protocol Model (OPM)-like language.

BRIEF DESCRIPTION OF FIGURES

Particular embodiments of the invention will be described with referenceto the following description of embodiments in conjunction with thefigures, wherein identical structures, elements or parts which appear inmore than one figure are preferably labeled with a same or similarnumber in all the figures in which they appear, in which:

FIG. 1 is a schematic illustration of a multi-database query system, inaccordance with an embodiment of the invention; and

FIG. 2 is a flowchart of the actions performed by the multi-databasequery system of FIG. 1, in accordance with an embodiment of the presentinvention.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 is a schematic illustration of a multi-database query system 20,in accordance with an embodiment of the invention. System 20 mediatesbetween an end-user 22, and a plurality of service providers whichinclude databases 24 and one or more data manipulation servers, such asa homology search engine 26. Error detection processes are anotherexample of data manipulation servers. Engine 26 is a data manipulationserver in that it provides processing services and is not primarily usedfor storing and providing information. In some embodiments of theinvention, engine 26 does not store information and a user requestingprocessing services must provide the information to be processed or mustprovide a link to a database or file containing the information. Datamanipulation servers may receive a single input of data, e.g., errordetection processes which receive a single sequence, or a plurality ofinputs, e.g., homology engines which compare sequences from twodifferent sources. One of the objects of some embodiments of theinvention is to allow end-user 22 to relate to homology engine 26 and/orto other data manipulation servers as if they were databases 24.

Databases 24 may be organized differently from each other and are notgenerally controllable by a supervisor of system 20. End user 22provides system 20 with queries in a query-language of system 20, forexample a structured query language, such as OPM. In some embodiments ofthe invention, a single query may be directed to more than one serviceprovider. For example, a single query may be directed to a plurality ofdatabases 24 and to homology engine 26.

In some embodiments of the invention, system 20 comprises a graphicaluser interface 28 which receives queries in a graphical form andtranslates them into the system's query language. Alternatively oradditionally, system 20 comprises a command-line interface 30 whichreceives commands from end-user 22 directly in the system's querylanguage or possibly using natural language. Further alternatively oradditionally, system 20 comprises a remote-unit interface 32 whichreceives queries from remote computer units.

System 20 further comprises a multi-database query processor 34 whichreceives queries from interfaces 28, 30 and/or 32 and processes them, asdescribed hereinbelow. In some embodiments of the invention, queryprocessor 34 and interfaces 28, 30 and/or 32 are implemented in softwareon a single computer 36 accessible to end-user 22. Alternatively, adistributed configuration is used.

In some embodiments of the invention, system 20 further comprises, foreach database 24, an OPM translation server 38 that mediates betweenprocessor 34 and the respective service provider. In some embodiments ofthe invention, translation servers 38 translate queries from the querylanguage of system 20 into query languages supported by the respectivedatabase 24. Optionally, translation servers 38 translate query resultsreceived from the databases 24 into the structural object classes ofsystem 20.

In a similar manner, system 20 comprises an OPM translation server 42which mediates between processor 34 and homology engine 26. In someembodiments of the invention, translation server 42 translates queryportions from the query language of system 20 into commands supported byhomology engine 26. That is, the OPM language allows, in accordance withembodiments of the invention, phrasing queries that access homologyengine 26 as a database. Translation server 42 translates querydirectives, such as limitations, into commands to be performed byhomology engine 26. In addition, translation server 42 optionallytranslates the output from homology engine 26 into structural objects,in accordance with the query language used by system 20. An exemplarystructural definition of objects used to access a homology engine fromthe OPM suite is described in Table 1.

TABLE 1 SCHEMA blast_srv DESCRIPTION: “The OPM schema for a queryableblast server” CONTROLLED VALUE CLASS BlastEngine_Cv {“wu_blast 2.0”,“ncbi_blast 2.0”} DEFAULT: “wu_blast 2.0” CONTROLLED VALUE CLASSBlastProgram_Cv {“blastn”, “blastx”, “blastp”, “tblastn”, “tblastx”}DEFAULT: “blastn” CONTROLLED VALUE CLASS Strand_Cv {“top”, “bottom”,“both”} DEFAULT: “both” CONTROLLED VALUE CLASS SortBy_Cv {“pvalue”,“count”, “highscore”, “totalscore”} DEFAULT: “pvalue” CONTROLLED VALUECLASS GenCode_Cv {(“Standard or Universal”, 1), (“VertebrateMitochondrial”, 2), (“Yeast Mitochondrial”, 3), (“Mold, Protozan,. ..”,4), (“Invertebrate Mitochondrial”, 5), (“Ciliate Macronuclear”, 6),(“Encinodermate Mitochondrial”,9), (“Alternative Ciliate Macronuclear”,10), (“Eubactrial”, 11), (“Alternative Yeast”, 12), (“AscidianMitochondrial”, 13), (“Flatworm Mitochondrial”, 14) } DEFAULT: “Standardor Universal” CODE_TYPE: SMALLINT CONTROLLED VALUE CLASS Filter_Cv{(“none”, 0), (“seg”, 1), (“xnu”, 2), (“seg+xnu”, 3), (“dust”, 4) }DEFAULT: “none” CODE_TYPE: SMALLINT CONTROLLED VALUE CLASS Matrix_Cv{(“blosum62”, 0), (“blosum35”, 1), (“blosum40”, 2), (“blosum45”, 3),(“blosum50”, 4), (“blosum65”, 5), (“blosum70”, 6), (“blosum75”, 7),(“blosum80”, 8), (“blosum85”, 9), (“blosum95”, 10), (“blosum100”, 11),(“GONNET”, 12), (“pam10”, 13), (“pam20”, 14), (“pam30”, 15), (“pam40”,16), (“pam50”, 17), (“pam60”, 18), (“pam70”, 19), (“pam80”, 20),(“pam90”, 21), (“pam100”, 22), (“pam110”, 23), (“pam120”, 24),(“pam130”, 25), (“pam140”, 26), (“pam150”, 27), (“pam160”, 28),(“pam170”, 29), (“pam180”, 30), (“pam190”, 31), (“pam200”, 32),(“pam210”, 33), (“pam220”, 34), (“pam230”, 35), (“pam240”, 36),(“pam250”, 37), (“pam260”, 38), (“pam270”, 39), (“pam280”, 40),(“pam290”, 41), (“pam300”, 42), (“pam310”, 43), (“pam320”, 44),(“pam330”, 45), (“pam340”, 46), (“pam350”, 47), (“pam360”, 48),(“pam370”, 49), (“pam380”, 50); (“pam390”, 51), (“pam400”, 52),(“pam410”, 53), (“pam420”, 54), (“pam430”, 55), (“pam440”, 56),(“pam450”, 57) } DEFAULT: “blosum62” CODE_TYPE: SMALLINT CONTROLLEDVALUE CLASS DB_Cv {“testdb”, “localdb”, “dbest”} DEFAULT: “testdb”OBJECT CLASS Blast_Call DESCRIPTION: “A blast call object represents aparticular homology search using a blast engine” ID: callId ATTRIBUTEcallId: INTEGER REQUIRED ATTRIBUTE engine: BlastEngine_Cv REQUIREDATTRIBUTE program: BlastProgram_Cv REQUIRED ATTRIBUTE query:VARCHAR(2000) REQUIRED ATTRIBUTE datasource: DB_Cv REQUIRED ATTRIBUTEoutput: set-of[1,] Blast_Output REQUIRED ATTRIBUTE matrix: Matrix_CvOPTIONAL ATTRIBUTE strand: Strand_Cv OPTIONAL ATTRIBUTE sortby:SortBy_Cv OPTIONAL ATTRIBUTE dbgcode: GenCode_Cv OPTIONAL ATTRIBUTEfilter: Filter_Cv OPTIONAL ATTRIBUTE threshold: REAL OPTIONAL ATTRIBUTEalignments: INTEGER OPTIONAL ATTRIBUTE scores: INTEGER OPTIONALATTRIBUTE param_E: REAL OPTIONAL ATTRIBUTE param_S: REAL OPTIONALATTRIBUTE param_E2: REAL OPTIONAL ATTRIBUTE param_S2: REAL OPTIONALATTRIBUTE param_W: INTEGER OPTIONAL ATTRIBUTE param_T: INTEGER OPTIONALATTRIBUTE param_X: INTEGER OPTIONAL ATTRIBUTE param_N: INTEGER OPTIONALATTRIBUTE param_M: INTEGER OPTIONAL ATTRIBUTE param_B: INTEGER OPTIONALATTRIBUTE param_V: INTEGER OPTIONAL OBJECT CLASS Blast_OutputDESCRIPTION: “The output of a specific blast call” ID: runId ATTRIBUTErunId: INTEGER REQUIRED ATTRIBUTE program: VARCHAR(8) REQUIRED ATTRIBUTEversion: VARCHAR(20) REQUIRED ATTRIBUTE revision: VARCHAR(20) REQUIREDATTRIBUTE build: VARCHAR(40) REQUIRED ATTRIBUTE queryId: VARCHAR(20)REQUIRED ATTRIBUTE querySeq: VARCHAR(2000) REQUIRED ATTRIBUTEqueryLength: INTEGER REQUIRED ATTRIBUTE database: DB_Cv REQUIREDATTRIBUTE hits: set-of[1,] BlastHits REQUIRED ATTRIBUTE dbSize_Seqs:INTEGER REQUIRED ATTRIBUTE dbSize_Letters: INTEGER REQUIRED ATTRIBUTEdbFile: VARCHAR(80) REQUIRED ATTRIBUTE dbReleased: VARCHAR(40) REQUIREDATTRIBUTE dbPosted: VARCHAR(40) REQUIRED ATTRIBUTE hitSatE: INTEGERREQUIRED ATTRIBUTE searchTime: VARCHAR(40) REQUIRED ATTRIBUTE totalTime:VARCHAR(40) REQUIRED ATTRIBUTE runDate: VARCHAR(40) REQUIRED ATTRIBUTEparameters: set-of[1,] OutputParameters REQUIRED OBJECT CLASSOutputParameters ID: paramId ATTRIBUTE paramId: INTEGER REQUIREDATTRIBUTE strand: VARCHAR(10) REQUIRED ATTRIBUTE frame: VARCHAR(10)REQUIRED ATTRIBUTE matrixId: VARCHAR(10) REQUIRED ATTRIBUTE matrixName:VARCHAR(10) REQUIRED ATTRIBUTE lamdba_Used: VARCHAR(10) REQUIREDATTRIBUTE K_Used: VARCHAR(10) REQUIRED ATTRIBUTE H_Used: VARCHAR(10)REQUIRED ATTRIBUTE lamdba_Computed: VARCHAR(10) REQUIRED ATTRIBUTEK_Computed: VARCHAR(10) REQUIRED ATTRIBUTE H_Computed: VARCHAR(10)REQUIRED ATTRIBUTE param_E1: VARCHAR(10) REQUIRED ATTRIBUTE param_S1:VARCHAR(10) REQUIRED ATTRIBUTE param_W1: VARCHAR(10) REQUIRED ATTRIBUTEparam_T1: VARCHAR(10) REQUIRED ATTRIBUTE param_X1: VARCHAR(10) REQUIREDATTRIBUTE param_E2: VARCHAR(10) REQUIRED ATTRIBUTE param_S2: VARCHAR(10)REQUIRED OBJECT CLASS BlastHeader DESCRIPTION: “The header section ofBLAST output” ID: headerId ATTRIBUTE headerId: INTEGER REQUIREDATTRIBUTE program: VARCHAR(8) REQUIRED ATTRIBUTE version: VARCHAR(20)REQUIRED ATTRIBUTE revision: VARCHAR(20) REQUIRED ATTRIBUTE build:VARCHAR(40) REQUIRED ATTRIBUTE queryId: VARCHAR(20) REQUIRED ATTRIBUTEquerySeq: VARCHAR(2000) REQUIRED ATTRIBUTE database: DB_Cv REQUIREDATTRIBUTE numOfSequences: INTEGER REQUIRED ATTRIBUTE numOfLetters:INTEGER REQUIRED OBJECT CLASS BlastHits DESCRIPTION: “Blast Hits” ID:accession ATTRIBUTE accession: VARCHAR(12) REQUIRED ATTRIBUTEdescription: VARCHAR(255) REQUIRED ATTRIBUTE score: INTEGER REQUIREDATTRIBUTE pvalue: REAL REQUIRED ATTRIBUTE num: INTEGER REQUIREDATTRIBUTE length: INTEGER OPTIONAL ATTRIBUTE hsp: set-of[1,] BlastHSPOPTIONAL OBJECT CLASS BlastHSP ID: hspId ATTRIBUTE hspId: INTEGERREQUIRED ATTRIBUTE score: INTEGER REQUIRED ATTRIBUTE expect: REALREQUIRED ATTRIBUTE pvalue: REAL REQUIRED ATTRIBUTE strand1: VARCHAR(1)REQUIRED ATTRIBUTE strand2: VARCHAR(1) REQUIRED ATTRIBUTE identities:REAL REQUIRED ATTRIBUTE positives: REAL REQUIRED ATTRIBUTE query(sequence, begin, end): (VARCHAR(500) REQUIRED, INTEGER REQUIRED,INTEGER REQUIRED) ATTRIBUTE target (sequence, begin, end): (VARCHAR(500)REQUIRED, INTEGER REQUIRED, INTEGER REQUIRED) ATTRIBUTE align:VARCHAR(500) REQUIRED ATTRIBUTE t5_(—)begin: INTEGER REQUIRED ATTRIBUTEt5_end: INTEGER REQUIRED

The structural definition of Table 1 is written in a language used todefine OPM objects, described for example in Chen, I. A.; Kosky, A. S.;Markowitz, V. M.; Szeto, E.; and Topaloglou, T., 1998. “Advanced QueryMechanisms for Biological Databases” in Proceedings of the 6^(th)International Conference on Intelligent systems for Molecular biology(ISMB'98), the disclosure of which is incorporated herein by reference.

Alternatively or additionally, a single translation server 38 may beused for more than one service provider. Alternatively or additionally,OPM processor 34 performs some or all of the translation tasks oftranslation servers 38 and 42. In some embodiments of the invention, OPMservers 38 and 42 are situated on the same computer as their respectiveservice providers 24 and 26. Alternatively, OPM servers 38 and 42 arelocated on computers proximal to their respective service providers 24and 26, although translation servers may be located substantiallyanywhere.

In some embodiments of the invention, a multi-database directory 40 isused by processor 34 to determine to which service provider 24 and 26,the portions of a query are directed. Directory 40 summarizes thecontents, organization methodologies and capabilities of databases 24and engines 26. In some embodiments, a single directory is used for aplurality of query processors 34, such that adding additional serviceproviders to system 20 requires only preparing a respective OPM serverfor the additional service providers and updating directory 40, while nochanges are needed in processors 34.

In some embodiments of the present invention, the various components ofsystem 20 interact using a distributed-object technology, such as, theCommon Object Request Broker Architecture (CORBA) which is described,for example, in the Web Site of the “Object Management Group” (OMG) atwww.omg.org and was available on Jun. 27, 1999. The disclosure of thisweb site is incorporated herein by reference. In some embodiments of theinvention, a plurality of different CORBA interfaces are used in system20 for different types of interactions between the components of system20. In one example, a first CORBA interface is used for programming anda second interface is used for object transfer and/or sharing.Optionally, remote-unit interface 32 also comprises a CORBA interface.

Alternatively or additionally, other distributed-object technologies,such as, Microsoft's Component Object Model (COM) or the UNIXenvironment Remote procedure call (RPC), may be used to allow remoteand/or non-remote components of system 20 to interact. Furtheralternatively or additionally, system 20 may be implemented in itsentirety by a single process and/or on a single processor.

TABLE 2 (1) SELECT l = r.fragId, a = h.accessor (2) FROM r inlocal:Fragments (3) bc in blast:Blast_Call (4) bo in bc.output (5) h =bo.summary.sequence (6) WHERE r.finished = “today” and (7) bc.querySeq =r.sequence and (8) bc.command = “blastn” and (9) bc.dataSource = “dbEST”and (10) h.length > 300

Table 2 illustrates a sample query received by query processor 34 fromany of interfaces 28, 30 and 32. The query in table 2 is writtenaccording to the OPM query language described, for example, in theISMB'98 publication referenced hereinabove. This OPM query languageallows accessing a plurality of databases 24 from a single query. Thequery of table 2 relates to both a database 24 and an homology engine26, the homology engine being accessed as if it were a database.

The query in table 2 is built of three sections. A first section labeledSELECT states the fields which are to appear in the output generatedresponsive to the query. In table 2 these fields are a “fragId” field ofa variable r, and an “accessor” field of a variable h (the variables rand h are defined in the second section). A second section, labeledWHERE, defines the variables mentioned in the query by stating thedatabase object classes to which they relate. That is, the secondsection states which objects are candidates for fulfilling the query.

In table 2, the variable r, for example, corresponds to a “Fragments”object class in a database named “local”. In the same way, a dummyvariable “bc” corresponds to an object class named “Blast_Call” in apseudo database “blast”. However, unlike variable r which represents anactual field of data in a database 24, variable “bc” does not representany such field, and a database “blast” does not actually exist.

Rather, when the “blast” database is referred to in a query, processor34 refers to homology engine 26. In some embodiments of the invention,translation server 42 performs any required translations to the inputand output of homology engine 26, such that the homology engine appearsto processor 34 as a database. In an exemplary embodiment of the presentinvention, the entire interface with homology engine 26 is structured ina single translation object, for example, in accordance with the“Blast_Call” object class in table 2, which is defined in Table 1. Thetranslation object includes the input to and output from homology engine26. For example, the “Blast_Call” object class has fields which relateto the commands to engine 26, such as, a “command” field which statesthe type of command performed by engine 26, a “querySeq” field whichstates an input sequence to be compared by the engine and a “dataSource”field which states a database of sequences to which the input sequenceis compared. In addition, the “Blast_Call” object class has an “output”field into which the output from homology engine 26 is preferablystructurally stored. In the query of table 2, a dummy variable, “bo”,refers to the sub-object “output”, thus simplifying the querystatements.

When a query relates to an action, such as a search or a filter to beperformed in a pseudo database, processor 34 first has the respectiveengine 26 perform any required commands to fill up the output fields ofthe object representing the pseudo database, e.g., “Blast_Call”, andonly then the search is performed. Alternatively or additionally, as theoutput records become available from homology engine 26 they are sentfor further processing. In some cases, the records can be processed evenbefore all the fields are available from engine 26. One example of aquery optimization as applied to data manipulation servers is that thequery translator instructs the engine to prepare only those resultfields which are actually required for further processing or display.Another example of optimization is allowing some of the fields to beprovided at a later time than other fields. Modifying the order ofgeneration of fields, even between records, may be useful if the somefields are required for further data manipulation or for a queryingagainst a slow database and are thus time critical. For some types ofdata manipulation, it may even be useful to start the manipulation withonly part of the fields and then repeat the manipulation with the restof the fields. One example where it is useful to start manipulatingbefore all the fields are available is where the manipulation can becarried out, at least to some extent, without the field or where thevalue of the field or the range of possible values of the field can beknown. Thus, for example, a DNA homology can be failed based on both ofthe strands not matching, even before it is known which strand needs tobe matched. Once the strand information is available, the group ofaccepted matches can be further limited using that information.

Thus, system 20 can have different parts of a query evaluated inparallel, in particular, time consuming parts performed by datamanipulation servers. For example, homology engine 26 may begin tooperate as records from another part of a query become available, and/orthe output from engine 26 may be processed as it is provided, withoutwaiting for all the results. This parallelism is possible becausehomology engine 26 is accessed from within the query. An advantage ofsome embodiments of the invention is the savings in response time and incommunication and CPU resources of complex queries due to thisparallelism.

In some cases, such parallel processing of data manipulation may requirethe data manipulation server or the data manipulation program itself tobe modified to take the timing information into account. In one example,a blast server may associate the actual partial information used with aresult record set, so that it can further limit the search results afterthe fact.

A third section of the query, labeled WHERE, states the conditions to befulfilled by those objects selected by the query. In table 2 theseconditions include that a field named “finished” of the variable r musthave a value “today”, a field “querySeq” of the variable bc must have avalue equal to the value of the field “sequence” of variable r, etc. Inthis section, the conditions on database objects and on pseudo databaseobjects are stated substantially in the same way.

FIG. 2 is a flowchart of the actions performed in processing a query bysystem 20, in accordance with an embodiment of the present invention.Upon receiving a query, such as the query in table 2, processor 34divides (60) the query into parts which are performed by the variousservice providers 24 and 26. Processor 34 determines, for example usingmethods known in the art, to which service provider each line in thequery is directed. In an exemplary embodiment of the present invention,the determination is performed by reference to directory 40. In thequery of table 2, processor 34 determines from the second line thatvariable r is to be searched in the database 24 named “local”. From thethird line it is determined that variable bc is to be “searched” inengine 26 named “blast”. Therefore, lines 2 and 6 of the query aredirected to the database “local” and lines 3, 7, 8 and 9 are directed tohomology engine 26. Lines 1, 4, 5 and 10 do not refer to any databaseand therefore they are processed by processor 34.

Processor 34 then determines (62) the cross-dependence of the parts ofthe query, i.e., which parts require data from other parts and thereforemust receive the data from the other parts before they are performed. Intable 2, it is determined from the line 7 that the query part directedto homology engine 26 requires output from another query part.

Thereafter, processor 34 sends (64) to OPM translation servers 38 and/or42 a first round of query parts belonging to their respective serviceproviders 24 and 26. The query parts sent in the first round are thosewhich do not require results from other queries. In table 2, the partrelating to variable r, i.e., lines 2 and 6, are sent to the OPM server38 of database “local”. These lines designate a query for all theFragment objects in the database which have a value “today” in their“finished” field. The OPM server translates (66) the received query partinto a language recognized by database “local”. The translated querypart is passed to the database 24 which processes (68) the query andreturns (70) the results of the query to the respective OPM server 38.The OPM server 38 translates (72) the results received from the database24 into the OPM result format and passes the translated results toprocessor 34.

If (74) the query includes additional query parts which were notperformed yet, e.g., query parts dependent on results from otherqueries, steps 64, 66, 68, 70 and 72 are repeated for the additionalquery parts. In the example of table 2, the query part formed of lines3, 7, 8 and 9 is passed to the translation server 42 of homology engine26. The translation server 42 translates (66) the query part intocommands performed by homology engine 26. For each sequence of variabler in the output of database “local”, translation server 42 sends a“blastn” command to engine 26 to perform a homology comparison betweenthe sequence and the database “dbEST”. The results received from engine26 are summarized (72) by translation server 42 in the “output” field ofthe “Blast_Call” object.

In some embodiments of the present invention, system 20 begins a secondround of processing query parts before a first round on which the secondround depends, is finished. Rather, as the first round provides recordsas results, the second round can manipulate them.

Once all the query parts were handled by their respective serviceproviders 24 and 26, processor 34 performs (76) any remaining operationsin the queries and provides (78) the user with the results required inthe SELECT section of the query. In the example of table 2, processor 34performs the comparison in line 10 of the query. Variable h refers tothe field “sequence” of the sub-object “summary” of the object “output”,which represents the results from the blast comparison. Sequences havinga length greater than 300 are selected from the blast results. The useris then provided with the value of the “accessor” field of the variableh and with the value of the “fragId” field of the variable r, for allthe objects which fulfill the query.

The above description has focused on BLAST as a homology method,however, other types of homology servers may also be used, for exampleBLASTX, BLASTN and BLASTP. Additionally, other types of datamanipulation may be provided, for example, error correction, in which asequence is corrected for various types of errors. Another type of datamanipulation server is for example a server which guesses a ternarystructure of a protein from its sequence, for example the number ofalpha helixes or the protein's affinity to a certain DNA sequence.Alternatively to guessing the structure, the server may provide agrading facility which grades a list of provided sequences for affinityto the protein (or for similarity of their derived protein) or whichselects those sequences which have a certain affinity.

As can be appreciated, some of these data manipulation servers requireonly one input record set while others, require more than one inputrecord set. For example, a homology search can compare a first set ofrecords against records in a second database (fixed value) or against asecond set of provided records. In some cases, three or more inputs maybe provided, for example where a third record set includes a list ofrules which apply when comparing the two record sets. In some cases, allthe record sets need to be fully specified before the manipulation canbe performed. In other cases, only one or possibly not even one of therecord sets needs to be fully specified before starting themanipulation. The considerations for optimizing and performing inparallel can be applied to the availability of record sets as well. Insome embodiments of the invention, the definitions of how the datamanipulation server operates in the absence of data and/or the relativecomputation time for different tasks thereby are stored in directory 40,optionally along with other information useful for optimizing querieswhich include data manipulation.

An advantage of some of the above embodiments is that it is possible touse substantially any tool developed for manipulation of databases toaccess data manipulation servers. For example, graphic interface 28 maybe an interface developed solely for preparing queries for databaseservers, as described, for example, in Kosky, A. S., Chen, I. A.,Markowitz, V. M., and Szeto, E. “Exploring Heterogeneous BiologicalDatabases: Tools and Applications”, Proceedings of the 6th InternationalConference on Extending Database Technology (EDBT'98), Lecture Notes inComputer Science, Vol. 1377, Springer-Verlag, 1998, pp. 499-513, thedisclosure of which is incorporated herein by reference. A user may usethis interface to prepare sophisticated queries which include access todata manipulation servers, such as homology search engines.

Likewise, optimization tools designed for database queries may beapplied, in accordance with the above embodiments, to queries whichinclude reference to data manipulation servers. Such optimization isespecially important for queries which reference data manipulationservers because usually these servers require substantially moreprocessing time than databases.

Furthermore, the results of the queries are optionally provided in asingle common format which allows use of a single standard outputinterface to display the results.

In addition, variables representing database and pseudo database objectsmay be linked together using methods for linking databases described,for example, in the EDBT'98 publication referenced hereinabove. Theselinking methods allow simpler statement of queries and hence moretransparency to the user who does not need to know the structure of thevarious servers used.

Although the above described embodiments refer to queries which relateto data manipulation servers as to databases, some embodiments of theinvention relate to queries which include commands to be performed bydata manipulation servers, not necessarily in the same manner in whichdatabases are searched. For example, a query may include an explicitcommand to be carried out by a data manipulation server, e.g., homologyengine 26. Such commands are referred to herein as application specificdata type (ASDT) commands.

TABLE 3 (1) SELECT l = r.fragId, a = h.accessor (2) FROM r inlocal:Fragments (3) b in blast:Output (4) h = bo.summary.sequence (5)WHERE r.finished = “today” and (6) r.sequence.blast(“dbEST”) and (7)b.query = r.sequence and (8) h.length > 300

Table 3 shows a query similar to the query of table 2 in which homologyengine 26 is activated using explicit commands written in a formatacceptable by OPM processor 34. Line 6 in table 3 is a command toperform “blast” on the “sequence” fields of the possible values ofvariable r. The blast is performed against a database “dbEST”. Theresults from performing the blast command appear in a variable b whichis defined in line 3 of table 3.

In an embodiment of the present invention, when processor 34 encountersan ASDT command, such as the “blast” command on line 6, it first checkswith the database involved, i.e., the “local” database, whether thedatabase supports the command in the specific syntax. Then, processor 34consults directory 40 to determine a server which has the routineinvoked by the command. Processor 34 passes the ASDT command, withwhatever data objects to which the command relates, directly to thedetermined server. Alternatively, the command is passed throughtranslation server 42. The output from the server is optionally passedto processor 34 in a structured form, as described above, so as to alloweasy manipulation of the results. In this embodiment, processor 34 doesnot model homology engine 26 as a database 24, but does access thehomology engine from within a complex query which accesses databases.

The ASDT commands do not necessarily appear in the WHERE section of thequery. Table 4 shows a query in which a command appears in the SELECTsection of the query. The command is processed after the query isevaluated, at a stage of presenting the results of the query.

TABLE 4 (1) SELECT x.gelId (2) x.image.crop(0,0,200,400).display() (3)FROM x in Gel (4) WHERE x.gelId = “gel_000111”

In table 4, an “image” field of the variables x which satisfy the queryare passed to a routine “crop”, which returns a piece of an image havingspecified coordinates. The results from the routine “crop” are passed toa routine “display” which displays the result in any desired manner.

The routines referenced by the ASDT commands may be evaluated by a datamanipulation server as described above with reference to the blastcommand evaluated by homology engine 26. Alternatively or additionally,some routines may be situated within processor 34 or in directory 40.The statement of the commands within a query rather than invoking thecommands on the results received from a query, is simpler to the user.In addition, invoking the commands from within the query allowsperforming the command before the results are passed to end-user 22. Inmany cases this conserves substantial communication resources.

In some cases users accessing databases are frequently interested inattributes which may be extracted from the image of a complex datafield, for example, a gel. Such attributes include, for example, thelength of an image of the gel, its average intensity or specific lanesof the image. Therefore, some databases have redundant data fields whichhave values for these attributes. By using ASDT commands these redundantfields are not needed. The routines invoked by the ASDT commands may bestored in the database 24, on a separate data manipulation server, indirectory 40 and/or in processor 34.

It is noted that the ASDT commands may be invoked implicitly asdescribed above with reference to FIG. 2. In some embodiments of theinvention, for each command, a command data object is defined whichincludes input and output fields of the command. An access to an outputfield of the object is translated by system 20 as an implicit invocationof the command.

It will be appreciated that the above described methods may be varied inmany ways, including, changing the order of steps, and the exactimplementation used. It should also be appreciated that the abovedescribed description of methods and apparatus are to be interpreted asincluding apparatus for carrying out the methods and methods of usingthe apparatus. Especially, the above methods should be interpreted todescribe software for carrying out a complete method as described above,a part thereof or software which modifies an existing software toperform as described above. In addition, the scope of the inventionincludes such software stored in a computer readable media, such as adisk, stored in a memory or executing on a computer.

The present invention has been described using non-limiting detaileddescriptions of embodiments thereof that are provided by way of exampleand are not intended to limit the scope of the invention. It should beunderstood that features and/or steps described with respect to oneembodiment may be used with other embodiments and that not allembodiments of the invention have all of the features and/or steps shownin a particular figure or described with respect to one of theembodiments. Variations of embodiments described will occur to personsof the art.

It is noted that some of the above described embodiments describe thebest mode contemplated by the inventors and therefore include structure,acts or details of structures and acts that may not be essential to theinvention and which are described as examples. Structure and actsdescribed herein are replaceable by equivalents which perform the samefunction, even if the structure or acts are different, as known in theart. Therefore, the scope of the invention is limited only by theelements and limitations as used in the claims. When used in thefollowing claims, the terms “comprise”, “include”, “have” and theirconjugates mean “including but not limited to”.

1. A multi-database query system for querying a plurality of biologicaldatabases containing biological data, comprising: an input whichreceives a query in a structured form; a processor for receiving thequery and dividing the query into a plurality of query parts, whereinthe plurality of query parts corresponds to at least one database of theplurality of biological databases and at least one condition statement;and at least one translation server which translates at least one atleast one of the plurality of query parts into commands recognized by adata manipulation server associated with a biological database of theplurality of biological databases and returns results of the query partsto the processor; wherein the processor determines whether the queryincludes unprocessed parts and, if the query has unprocessed parts,sends at least one unprocessed part to the at least one translationserver, repeating the process until all unprocessed parts are processed,and wherein the processor further applies one or more conditions withinthe at least one condition statement to the processed query andgenerates a user output meeting the one or more conditions.
 2. Thesystem according to claim 1, wherein the translation server modelsresults from the data manipulation server into database objects.
 3. Thesystem according to claim 1, wherein the data manipulation servercomprises a server that receives input from a least two differentsources.
 4. The system according to claim 1, further comprising adirectory in communication with the processor, wherein the processorrefers to the directory to determine how to divide the query.
 5. Thesystem according to claim 1, wherein the at least one translation servercomprises at least two translation servers associated with at least onelocal biological database and at least one remote database.
 6. Thesystem according to claim 5, wherein the query parts are sent to the atleast two translation servers in parallel.
 7. The system according toclaim 1, wherein the data manipulation server is a homology searchengine.
 8. The system according to claim 7, wherein the homology searchengine is BLAST.
 9. The system according to claim 1, wherein theprocessor further determines cross-dependence of the query parts and, ifcross-dependence is found, further divides the query parts intoindependent parts and dependent parts.
 10. The system according to claim9, wherein the query parts are cross-dependent and are sent to at leasttwo translation servers sequentially so that independent parts are sentbefore dependent parts.
 11. A method of querying a multi-database querysystem having a plurality of biological databases containing biologicaldata, comprising: (a) inputting a query in a structured form; (b)receiving the query in a processor and dividing the query into aplurality of query parts, wherein the plurality of query partscorresponds to at least one database of the plurality of biologicaldatabases and at least one condition statement; (c) using at least onetranslation server, translating at least one of the plurality of queryparts into commands recognized by a data manipulation server associatedwith a biological database of the plurality of biological databases andreturns results of the query parts to the processor; (d) determiningwhether the query includes unprocessed parts and, if the query hasunprocessed parts, sending at least one unprocessed part to the at leastone translation server; (e) repeating steps (c) and (d) until allunprocessed parts of the query are processed; (f) applying one or moreconditions within the at least one condition statement to the processedquery; and (g) generating a user output meeting the one or moreconditions.
 12. The method according to claim 11, wherein the at leastone translation server models results from the data manipulation serverinto database objects.
 13. The method according to claim 11, wherein thedata manipulation server comprises a server that receives input from aleast two different sources.
 14. The method according to claim 11,further comprising consulting a directory in communication with theprocessor to determine how to divide the query.
 15. The method accordingto claim 11, wherein the at least one translation server comprises atleast two translation servers associated with at least one localbiological database and at least one remote database.
 16. The methodaccording to claim 11, wherein the data manipulation server is ahomology search engine.
 17. The method according to claim 16, whereinthe homology search engine is BLAST.
 18. The method according to claim11, wherein the query parts are sent to at least two translation serversin parallel.
 19. The method according to claim 11, further comprisingdetermining cross-dependence of the query parts and, if cross-dependenceis found, dividing the query parts into independent parts and dependentparts.
 20. The method according to claim 11, wherein the query parts arecross-dependent and further comprising sending to at least twotranslation servers sequentially with independent parts are sent beforedependent parts.